Annotation-preserving machine translation of English corpora to validate Dutch clinical concept extraction tools

Abstract Objective To explore the feasibility of validating Dutch concept extraction tools using annotated corpora translated from English, focusing on preserving annotations during translation and addressing the scarcity of non-English annotated clinical corpora. Materials and Methods Three annotated corpora were standardized and translated from English to Dutch using 2 machine translation services, Google Translate and OpenAI GPT-4, with annotations preserved through a proposed method of embedding annotations in the text before translation. The performance of 2 concept extraction tools, MedSpaCy and MedCAT, was assessed across the corpora in both Dutch and English. Results The translation process effectively generated Dutch annotated corpora and the concept extraction tools performed similarly in both English and Dutch. Although there were some differences in how annotations were preserved across translations, these did not affect extraction accuracy. Supervised MedCAT models consistently outperformed unsupervised models, whereas MedSpaCy demonstrated high recall but lower precision. Discussion Our validation of Dutch concept extraction tools on corpora translated from English was successful, highlighting the efficacy of our annotation preservation method and the potential for efficiently creating multilingual corpora. Further improvements and comparisons of annotation preservation techniques and strategies for corpus synthesis could lead to more efficient development of multilingual corpora and accurate non-English concept extraction tools. Conclusion This study has demonstrated that translated English corpora can be used to validate non-English concept extraction tools. The annotation preservation method used during translation proved effective, and future research can apply this corpus translation method to additional languages and clinical settings.


Introduction
Electronic health records (EHRs) have become an invaluable source of real-world data for observational research, offering insights into disease prevalence, patient outcomes, and treatment effectiveness. 1,2While structured data, such as coded conditions, measurements, and prescriptions, are frequently used for analysis, a significant portion of valuable patient information remains locked within free text, such as nursing and physician notes. 3,4The extraction of information from these unstructured data in a structured manner, such as standardized clinical concepts from the Unified Medical Language System (UMLS), 5 can greatly enhance observational research by providing additional rich, detailed clinical information at scale. 4,6,7Numerous tools for this natural language processing (NLP) task of clinical concept extraction, which consists of both named entity recognition (NER) and named entity linking (NEL), have been developed for English clinical texts, 8-10 including tools such as cTAKES, 11 MetaMap, 12 QuickUMLS, 13 and MedCAT, 14 cloud-based tools, 15 and tools using generative large language models (LLMs). 16owever, the need for concept extraction tools and validating these tools extends beyond English, 10 particularly with the rise of real-world data utilization in observational clinical research across the multilingual continent of Europe, 17 as seen in initiatives like the European Medical Information Framework (EMIF), 18 the European Health Data & Evidence Network (EHDEN), 19 and the Data Analytics and Real World Interrogation Network (DARWIN EU). 20Utilizing unstructured data in large-scale analyses within standardized frameworks, such as the Observational Medical Outcomes Partnership Common Data Model (OMOP CDM), 21,22 highlights the importance of reliable information extraction for different languages.Nevertheless, the landscape of concept extraction tools for relatively small non-English languages such as Dutch, remains underdeveloped, and the limited number of tools currently available for Dutch clinical text, including adapted versions of Quick-UMLS 6 and MedCAT, 23 have not been publicly evaluated.At the same time, while it is not uncommon for extraction tools to lack validation, 24 most English extraction tools are validated using various public corpora annotated with clinical concepts, for example, i2b2, 25 ShARe/CLEF, 26 and Med-Mentions. 27While benchmarks exist for various other Dutch NLP tasks, 28 the absence of Dutch annotated clinical corpora poses a significant challenge for validation and comparison of the extraction tools in this language. 29,30reating an annotated clinical corpus in any language is resource-intensive, requiring significant labor to manually annotate numerous clinical texts in great detail. 10,313][34] For instance, a recent study demonstrated that LLM data generation can produce clinical texts in German, with entities annotated according to broad semantic categories. 35Besides synthesizing new data, a scalable option that relies on the models' creativity and domain knowledge, LLMs also enable data augmentation, notably by translating existing English corpora into other languages. 34,36While machine translation using LLMs has significantly improved in recent years, 28,37 merely translating the clinical texts of an annotated corpus is insufficient because the word locations of clinical entities within the text shift during translation, causing the loss of annotation information tied to specific text locations. 38lthough these annotations could be manually repositioned or aligned using secondary word alignment software after translation, 36,38,39 we propose a method that preserves annotation locations during translation by embedding the annotations within the text before translation and retrieving them afterward.
Our study investigates the feasibility of validating non-English, specifically Dutch, concept extraction tools using English-annotated corpora translated via machine translation with embedded annotations.We evaluate 2 English concept extraction tools that were adapted to Dutch, on 2 English annotated corpora and their Dutch translations, and a multilingual annotated corpus.We compare the concept extraction performance of the tools between the languages.

Experimental setup
The experimental setup consisted of 2 main parts.The first part involved the corpus translation and preparation phase, where 3 publicly available annotated corpora were standardized to the same format.This included translating English corpora into Dutch while preserving annotations and creating training and test sets.The second part involved applying and evaluating 2 concept extraction tools on the test sets, with one tool that supported supervised training, also using the training sets.The setup is visualized in Figure 1.

Corpora
The annotated corpora used in our study include the Med-Mentions corpus (MM), 27 the corpus from the ShARe/CLEF eHealth evaluation lab task 2 (SC), 26 and the multilingual Mantra corpus (MT). 29MM is a comprehensive biomedical corpus containing 4392 abstracts from PubMed, annotated with concepts across a wide range of biomedical semantic types.SC is a corpus derived from 432 clinical notes and is designed to facilitate tasks related to understanding clinical text, including entity recognition and normalization.The multilingual MT corpus provides annotations of 200 short texts from different parallel corpora (Medline abstract titles [MDL]  and sentences of drug labels from the European Medicines Agency [EMA]), in multiple languages, including English and Dutch.All 3 corpora feature annotations that link text spans to a UMLS concept, identified by a Concept Unique Identifier (CUI).To facilitate uniform analysis, all corpora were standardized into the same tabular format.This involved separating the text documents (Attributes: DocumentId, Text) and the concept annotations (Attributes: DocumentId, CUI, SpanStart, SpanEnd, SpanText).The SC corpus is pre-partitioned into training and test sets, whereas for MM and MT, we randomly allocated 80% of the data for training and 20% for testing.

Corpus translation
To develop the Dutch corpora of annotated clinical texts, we used the English annotated corpora as a starting point.Directly translating an English text would allow us to create a Dutch text, but the exact locations of the annotations would be lost.To address this, we propose a method for preserving the location of annotated concepts through a process of 3 steps (see Table 1 for an example).First, annotations are integrated directly into the clinical text by enclosing the text span and the CUI in square brackets, ie [[text span] [CUI]].Next, this text with embedded annotations is translated using machine translation, keeping the annotations intact.Finally, the annotations are extracted from the translated text using a simple regular expression pattern ), resulting in separate text documents and annotations again.
To experimentally assess the impact of translation in this process, we utilized and compared 2 different machine translation services: the Cloud Translation API from Google (referred to as Google) and the GPT-4 Turbo (gpt-4-0125preview) API from OpenAI (referred to as GPT). 40Google's service offered direct machine translation of documents.In contrast, GPT, a generative text model, required a specific system prompt besides the document text to guide a zeroshot translation process:

"Translate the document to Dutch (Nederlands). Keep the formatting the same, including the in-text annotations: [[span][code]]."
This approach allowed us to compare a traditional translation service with a state-of-the-art generative text model in preserving annotated concept locations during translation.We evaluated the quality of the Google and GPT translations by assessing their similarity to each other and to manual Dutch translations in the Mantra corpus.We used the bilingual evaluation understudy (BLEU) algorithm 41 and the character ngram F-score (chrF) 42 as translation evaluation metrics.Furthermore, to evaluate the quality of annotation preservation, we compared descriptive statistics like document size and the number of annotations before and after translation.Additionally, we quantified formatting errors, ie, the erroneous placing of brackets in the translated text, by counting the CUIs in the final translated text, as an annotation with a wrong bracket pattern is not extracted, and its CUI remains in the text.

Concept extraction tools
In our study, we validated and compared 2 concept extraction tools: MedSpaCy (https://github.com/medspacy/medspacy)and the Medical Concept Annotation Toolkit (MedCAT; https://github.com/CogStack/MedCAT).These Python tools, both open source and publicly available, were initially designed for extracting concepts from English texts and have been adapted for Dutch. 6,23edSpaCy extends the spaCy software library for clinical NLP tasks, including clinical concept extraction.It uses an adaptation of QuickUMLS, a tool for fast, unsupervised biomedical concept extraction based on string similarity and a reference concept dictionary.For English concept extraction, we utilized all UMLS concepts with English terms.For Dutch concept extraction, we used all Dutch vocabularies from UMLS.We replaced the English Systematized Nomenclature of Medicine Clinical Terms (SNOMED CT) vocabulary with the Dutch SNOMED CT translation (https://github.com/mierasmusmc/medspacy_dutch),maintained by NICTIZ, the Dutch National IT Institute for Healthcare (https://www.snomed.org/member/netherlands).If no Dutch version of a UMLS vocabulary existed, we kept the English one, as many concepts, such as drug and laboratory concepts, are language-independent.Further details on QuickUMLS settings are provided in Table S1, and detailed information on the concept dictionaries is available in Table S2.
MedCAT is an entity recognition and linking tool, employing context similarity based on word embeddings for concept recognition and disambiguation. 14It allows for both unsupervised training on unannotated clinical texts and supervised training on annotated texts.In this study, we used the publicly available pre-trained models for English and Dutch, developed using unsupervised training.The English model, trained on clinical notes from Medical Information Mart for Intensive Care (MIMIC) III, 43 used a subset of the English UMLS as its concept dictionary.The Dutch model is trained on medical Wikipedia articles in Dutch, incorporating all UMLS concepts with Dutch descriptions and the Dutch SNOMED CT translation. 23Additionally, to showcase Med-CAT's supervised training capabilities, we fine-tuned the pretrained unsupervised models with supervised learning, creating a separate supervised MedCAT model on the training set of each corpus.
To summarize, our analysis involves 3 types of concept extraction models: one MedSpaCy model and one unsupervised MedCAT model for both languages, plus 10 supervised MedCAT models-3 for English for each corpus and 7 for Dutch, corresponding to each corpus translation.

Concept extraction evaluation
All models were applied to their respective test sets in both English and Dutch.We evaluated the extraction performance using precision, recall, and their harmonic mean, the F1 score. 27A concept was considered correctly extracted if both the CUI and its location were accurately identified.All other predicted concepts were counted as false positives, and all unmatched reference concepts were counted as false negatives.Furthermore, to compare the overall performance across the languages and the concept extraction models, we mean-centered the evaluation results by the individual corpus.This involved subtracting the corpus mean metric value from each result, allowing for a comparison of extraction outcomes that is independent of the specific corpora.The statistical significance of the differences in metric value distribution was assessed using Bonferroni-adjusted Wilcoxon tests.

Corpus translation
The 3 corpora, MM, SC, and MT, were processed and translated using the 2 machine translation services, Google and GPT, while preserving the annotations.Table 2 shows the translation performance on the MT corpus and the agreement between Google and GPT translations across all corpora, as measured by average BLEU and chrF scores.The machine translations were reasonably close to the Dutch references in the MT corpus, with BLEU scores ranging from 0.28 to 0.39 and chrF scores around 0.70.The agreement between Google and GPT translations was very good, with BLEU scores ranging from 0.46 to 0.58 and chrF scores between 0.75 and 0.87 across all corpora.Table S3 provides translation examples to offer a sense of the individual BLEU and chrF scores, showing that machine translations with low similarity to the more loosely translated reference could still be considered natural, be of good quality, and show high agreement with each other.The MM and MT corpora translations are publicly available for transparency and reproducibility (https://github.com/mi-erasmusmc/DutchClinicalCorpora).
The characteristics of the corpora and their translations are presented in Table 3. MM was the largest corpus with also the highest annotation density, containing many annotations in a relatively short text, especially compared to the SC corpus.The translation of English to Dutch (with the annotations extracted) increases the number of characters, which is also visible in the existing Dutch translation in the multilingual MT corpus.The quality of in-text annotation preservation was measured by the number of missing annotations after the translation and how many of these annotations were missing due to formatting errors during translation.Overall, the preservation of annotations during translation was quite effective: the median number of annotations per document was the same or close to the original English version.However, GPT translations exhibited the highest percentage of missing annotations, with 5.9% in MM and 2.8% in SC.Google translations performed better, with less than 1% missing annotations, most of which could be attributed to formatting errors.In contrast, GPT showed a very low rate of formatting errors, and its missing annotations were primarily due to the pure loss of embedded annotations during translation: annotations were ignored in the generated text while keeping the sentence structure intact.Table 4 presents examples of both types of annotation preservation errors.Upon further inspection of the missing annotations in the GPT translation, we found that GPT primarily struggled with annotations related to verbal phrases or generic nouns, such as "investigates," "performed," "comparison," and "evaluation."These missing concepts were predominantly categorized under the more generic semantic types like Functional Concept (3715 concepts, 18%), Qualitative Concept (3593, 17%), and Activity (1591, 8%), while semantic types related to diseases and medicine were barely affected.For a detailed breakdown of the missing annotations in GPT's MM translations by semantic type and its most frequent unpreserved concepts, see Table S4.

Concept extraction
In total, 30 model and corpus combinations were evaluated.The performances, measured by the F1 score (F), recall (R), and precision (P), are visualized in Figure 2.For the exact values, see Table S5.Overall, the concept extraction tools performed similarly in English and translated Dutch corpora.Despite several differences between English and Dutch within each corpus, we observed, on average, that there were no significant language differences across the different corpora (Figure 3A).Additionally, there were no performance differences between models using Dutch Google or GPT translations or between MT translations and the existing Dutch MT corpus.
Within each corpus, performance differences could be observed between the concept extraction model types.In the MM corpus, the supervised MedCAT model performed the best according to the F1 score, followed by MedSpaCy, in both languages.The differences in performance could mainly be attributed to the large differences in recall.Notably, the low recall of the unsupervised MedCAT models showed a drastic improvement with supervised learning.Furthermore, the high recall of the English MedSpaCy was not mirrored in the Dutch version, possibly due to the wide semantic type range in MM and the lower number of concepts with Dutch terms in the Dutch MedSpaCy.In the MT corpus, we also found that the supervised MedCAT model had the best performance, with a similar performance of MedSpaCy in English, followed by unsupervised MedCAT in Dutch.Again, the high performance of MedSpaCy in English could be attributed to its high recall, as precision differences were relatively small.Conversely, in the SC corpus, the differences between the model types were mainly due to differences in their precision.Both unsupervised and supervised MedCAT models performed similarly, with the unsupervised model slightly outperforming the supervised one in the English corpus.Although supervised training improved recall, it reduced precision.Figure 3B presents the mean-centered performance across corpora, showing that, on average, supervised models performed best.While MedSpaCy models had a high recall similar to supervised MedCAT, their precision was consistently lower.

Evaluation of concept extraction using translated corpora
This study explored the feasibility of validating Dutch concept extraction tools on annotated corpora derived from translating existing English corpora.We validated 2 concept extraction tools in Dutch and English using 1 annotated multilingual corpus and 2 annotated English corpora.The results demonstrated the effective generation of Dutch annotated corpora through our proposed method, which preserves annotation location through translation, facilitating rapid, efficient, and accurate creation of Dutch corpora annotated with clinical concepts, without necessitating further postprocessing for text alignment. 36e successfully utilized 2 machine translation services, Google and GPT, for corpus translation.While both provided good quality translations, Google encountered more issues with annotation formatting, whereas GPT translations had a larger number of missing annotations, primarily due to  problems with verbal phrases and generic nouns, affecting the preservation of annotations.This issue was particularly notable in the MM corpus, which has a high annotation density.The exact reason for these missing annotations is unclear but may be related to these phrases and nouns not typically annotated as clinical entities.
The translation process from English to Dutch did not significantly impact the performance of concept extraction, with models showing comparable effectiveness across languages and corpora.Moreover, no significant differences were observed in model performance between Google or GPTtranslated corpora or between the Dutch MT corpus and the MT translations.These results confirm the feasibility of accurately translating existing annotated corpora for multilingual use, as demonstrated by the method's effectiveness for Dutch, which is broadly applicable and expected to perform well across various languages.
When comparing the performance of the concept extraction models across the different corpora, we found that the supervised MedCAT model generally performed best.The fine-tuning of the unsupervised models using supervised learning showed much improvement, especially in the MM and MT corpora.These findings for MedCAT align with those reported by the authors. 14MedSpaCy demonstrated a high recall across all corpora but suffered from lower precision, likely due to its reliance on an extensive concept database that led to the extraction of many correct concepts alongside numerous unannotated ones.
Overall, this research enhanced our understanding of the challenges and opportunities in creating multilingual annotated clinical corpora and validating non-English concept extraction tools, contributing to clinical NLP and data harmonization to improve observational research.

Biomedical settings
In this study, we included 3 different annotated biomedical corpora commonly used for evaluating concept extraction.However, the specific settings of these corpora should be considered when evaluating concept extraction for practical use.For example, SC contains clinical notes from an American Hospital EHR, which might differ from its Dutch counterpart due to variances in healthcare systems and practices.This presents an important limitation and potential source of bias when translating corpora.The context of one corpus might not transfer well to other biomedical settings, underscoring the importance of choosing the suitable corpus.This can be assessed by comparing the corpus with texts in the target setting and language, focusing on differences in terminology, reporting practices, and healthcare delivery.Nonetheless, while translating existing English corpora provides a rapid method for generating new corpora in other languages, creating new corpora based on texts in the target language and, crucially, within the target setting remains preferable.

Translation and annotation methods
For corpus translation, we relied on 2 leading LLM services.Initially, we explored more translation services but decided to focus on Google and GPT to keep the scope manageable and the narrative clear.Google was chosen for its widespread recognition and GPT for its state-of-the-art text interpretation and generation capabilities, alongside their relative costeffectiveness.We also explored various methods for embedding annotations in text, such as using curly or angle brackets and Standard Generalized Markup Language (SGML).We found square brackets easy to implement and effective, with fewer formatting errors during translation compared to SGML methods.Square brackets also appeared less frequently in the original text compared to other types of brackets, simplifying retrieval.However, the choice of embedding method might depend on the data, and a formal comparison could be conducted in future work.

Machine translation evaluation
We evaluated the accuracy of machine translations in the MT corpus, which included manual Dutch translations for reference.We did not compare MM and SC corpora translations against a manual reference but observed a high agreement between Google and GPT translations.Moreover, the findings from the MT corpus are likely applicable to the other corpora, as recent studies have shown similar translation performances, [44][45][46] and we identified no issues through empirical observation and comparison during the study.Despite this, machine translation is not infallible, and nuances or the naturalness of the text may be lost, potentially impacting the reliability of the annotated corpus.We assessed and quantified errors in the annotation-preserving translation process using simple metrics and published details on missing annotations for public scrutiny.With GPT, we only employed zero-shot prompting and observed good results despite occasional annotation losses and minimal formatting errors.However, GPT allows further improvement through techniques like prompt optimization and few-shot learning, highlighting its versatility.The Google translations exhibited more annotation formatting errors than GPT.While the Google translation model cannot be directly altered, addressing the errors with more complex regular expressions would further improve the annotation preservation.

Concept extraction tool validation
We validated and compared 2 concept extraction tools chosen for their ease of use, integration of both NER and NEL, and availability in Dutch and English.While more advanced biomedical embedding models for NEL, such as BioLORD-2023-M 47 and mSapBERT, 48 exist, evaluating only the NEL task or integrating these embedding models into MedCAT or MedSpacy was beyond this study's scope but remains interesting for future research.The performance of the concept extraction models was not optimal, with F1 scores of the best models ranging between 0.5 and 0.7 across corpora.Although we used default settings for the MedS-paCy and MedCAT models, further optimization could enhance performance.Moreover, our stringent evaluation required an exact match of the predicted and annotated CUI to be considered correct. 27We observed that many predicted concepts closely matched the annotated concepts and, in some instances, could be considered more accurate.For instance, the word "seizure" is annotated as C0036572: Seizures, but the model extracts the similar C4229252: Seizurelike activity.Similarly, the phrase "cocaine use" is annotated as C0009171: Cocaine Abuse, but the model extracts C3496069: Cocaine Use.Therefore, a less strict evaluation method based on close concept similarity, measured by hierarchical or concept embedding distance, would likely yield higher performance.

Future work
Future work should explore the generalization of our corpus translation method to languages beyond Dutch.While translating existing corpora offers an efficient alternative to creating new ones from scratch, comparing this method to others, such as exploring synthesizing corpora using LLMs to generate new CUI annotated data based on examples or combining multiple strategies, would be worthwhile.Our annotation preservation technique shows promise, but further research is needed to optimize its accuracy.Improvement could involve experimenting with various LLM models, employing oneshot or few-shot prompting, using more extensive prompts with more instructions, and fine-tuning models.A comparative analysis of our annotation preservation method with post-translation word alignment techniques, as proposed by others, 36 would also be valuable.Lastly, others can use the translated corpora from our study to evaluate different concept extraction tools, train models, and adapt our translation approach for translating other corpora in various clinical settings and NLP tasks.

Conclusion
This study demonstrated the feasibility of validating Dutch concept extraction tools using annotated corpora translated from English.The proposed method of preserving in-text annotations during translation through language models offers a promising alternative to post-translation realignment of words.The research extended to 3 different corpora, 2 machine translation services, and 2 extraction tools, showcasing the method's versatility and potential for multilingual clinical NLP advancement.While machine translation services like Google and GPT were effective in translating annotated clinical corpora, some issues were encountered, highlighting the need for ongoing optimization and error assessment.The comparison of concept extraction models showed that the supervised MedCAT model generally performed best, with MedSpaCy demonstrating high recall but lower precision.Future work should focus on generalizing the corpus translation method to other languages, optimizing annotation preservation techniques, and exploring different strategies for embedding annotations in text.Comparative analysis with post-translation word alignment techniques and further experimentation with various language models and prompting techniques could also enhance the accuracy and efficiency of concept extraction in multilingual settings.This study contributes valuable insights into expanding clinical data augmentation and concept extraction research for non-English languages, paving the way for more extensive multilingual clinical NLP applications and advancements in the field.

Figure 2 .
Figure 2. Concept extraction performance per model type and corpus combination on the English version (blue) and the (translated) Dutch versions (orange) of the 3 main corpora, measured by the 3 metrics: F1 score (F), precision (P), and recall (R).

Figure 3 .
Figure 3. Performance comparison of (A) the different corpus languages and (B) the different concept extraction models independent of the corpora, using mean-centered metric value distributions.The dashed line indicates the mean center.The significant Bonferroni-adjusted Wilcoxon test results between the distributions are shown above the boxplots if significant, where " � " indicates a P-value < .01.The points represent the underlying data.

Table 1 .
Example phrase from the MT corpus to illustrate the steps in the in-text annotation translation process.

Table 3 .
Characteristics of the original English and the Dutch translated corpora.

Table 2 .
Performance of the machine translation compared to the Dutch reference in the MT corpus and the agreement between Google and GPT machine translations across all corpora, measured by the average BLEU and chrF scores and their standard deviations (SD).

Table 4 .
Examples of sentences from the MM corpus with formatting errors and the loss of annotations.